Alberto Cano, Ph.D.

Associate Professor

Richmond VA UNITED STATES
Engineering Research Building Room 2314

acano@vcu.edu

Dr. Cano specializes in machine learning, data mining, classification, big data, data streams, and high-performance computing.

Contact

Social

Biography

Alberto Cano is an Associate Professor with the Department of Computer Science, Virginia Commonwealth University, Richmond, Virginia, United States, where he heads the High-Performance Data Mining laboratory. His research is focused on machine learning, big data, data streams, concept drift, continual learning, GPUs and distributed computing. He is also the Faculty Director of the High Performance Research Computing Core Facility at VCU: https://hprc.vcu.edu/

Areas of Expertise

Machine Learning

Data Mining

Classification

Big Data

High Performance Computing

Accomplishments

Top 2% of most cited researchers in AI field by Stanford University ranking

2022-12-01

Stanford University Scientist Rankings

Amazon Machine Learning Award

2018-08-01

Hate Speech Detection on Amazon Reviews using Data Stream Mining on Spark and AWS

Education

University of Granada, Spain

Ph.D.

Computer Science

2014

University of Cordoba, Spain

M.Sc.

Intelligent Systems

2013

University of Granada, Spain

M.Sc.

Soft Computing and Intelligent Systems

2011

Show All +

Research Grants

MRI: Track 1 Acquisition of NVIDIA DGX H100 GPU system for research and education at VCU

National Science Foundation

2023-09-06

NSF MRI

SentimentVoice: Integrating emotion AI and VR in Performing Arts

Commonwealth Cyber Initiative

2023-06-01

Integrating emotion AI and VR in Performing Arts

HPRC research computing clusters

State Council of Higher Education for Virginia

2022-12-01

HPRC research computing clusters

Show All +

Courses

CMSC 508 - Databases

Database Theory

CMSC 603 - High Performance Distributed Systems

High Performance Distributed Systems

Selected Articles

A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework

Machine Learning

G. Aguiar, B. Krawczyk, and A. Cano

2023-06-01

Class imbalance poses new challenges when it comes to classifying data streams. Many algorithms recently proposed in the literature tackle this problem using a variety of data-level, algorithm-level, and ensemble approaches. However, there is a lack of standardized and agreed-upon procedures and benchmarks on how to evaluate these algorithms. This work proposes a standardized, exhaustive, and comprehensive experimental framework to evaluate algorithms in a collection of diverse and challenging imbalanced data stream scenarios. The experimental study evaluates 24 state-of-the-art data streams algorithms on 515 imbalanced data streams that combine static and dynamic class imbalance ratios, instance-level difficulties, concept drift, real-world and semi-synthetic datasets in binary and multi-class scenarios. This leads to a large-scale experimental study comparing state-of-the-art classifiers in the data stream mining domain. We discuss the advantages and disadvantages of state-of-the-art classifiers in each of these scenarios and we provide general recommendations to end-users for selecting the best algorithms for imbalanced data streams. Additionally, we formulate open challenges and future directions for this domain. Our experimental framework is fully reproducible and easy to extend with new methods. This way, we propose a standardized approach to conducting experiments in imbalanced data streams that can be used by other researchers to create complete, trustworthy, and fair evaluation of newly proposed methods. Our experimental framework can be downloaded from https://github.com/canoalberto/imbalanced-streams.

ROSE: Robust Online Self-Adjusting Ensemble for Continual Learning on Imbalanced Drifting Data Streams

Machine Learning

A. Cano and B. Krawczyk

2022-11-01

Data streams are potentially unbounded sequences of instances arriving over time to a classifier. Designing algorithms that are capable of dealing with massive, rapidly arriving information is one of the most dynamically developing areas of machine learning. Such learners must be able to deal with a phenomenon known as concept drift, where the data stream may be subject to various changes in its characteristics over time. Furthermore, distributions of classes may evolve over time, leading to a highly difficult non-stationary class imbalance. In this work we introduce Robust Online Self-Adjusting Ensemble (ROSE), a novel online ensemble classifier capable of dealing with all of the mentioned challenges. The main features of ROSE are: (1) online training of base classifiers on variable size random subsets of features; (2) online detection of concept drift and creation of a background ensemble for faster adaptation to changes; (3) sliding window per class to create skew-insensitive classifiers regardless of the current imbalance ratio; and (4) self-adjusting bagging to enhance the exposure of difficult instances from minority classes. The interplay among these features leads to an improved performance in various data stream mining benchmarks. An extensive experimental study comparing with 30 ensemble classifiers shows that ROSE is a robust and well-rounded classifier for drifting imbalanced data streams, especially under the presence of noise and class imbalance drift, while maintaining competitive time complexity and memory consumption. Results are supported by a thorough non-parametric statistical analysis.

Kappa Updated Ensemble for Drifting Data Stream Mining

Machine Learning

A. Cano and B. Krawczyk

2019-08-30

Learning from data streams in the presence of concept drift is among the biggest challenges of contemporary machine learning. Algorithms designed for such scenarios must take into an account the potentially unbounded size of data, its constantly changing nature, and the requirement for real-time processing. Ensemble approaches for data stream mining have gained significant popularity, due to their high predictive capabilities and effective mechanisms for alleviating concept drift. In this paper, we propose a new ensemble method named Kappa Updated Ensemble (KUE). It is a combination of online and block-based ensemble approaches that uses Kappa statistic for dynamic weighting and selection of base classifiers. In order to achieve a higher diversity among base learners, each of them is trained using a different subset of features and updated with new instances with given probability following a Poisson distribution. Furthermore, we update the ensemble with new classifiers only when they contribute positively to the improvement of the quality of the ensemble. Finally, each base classifier in KUE is capable of abstaining itself for taking a part in voting, thus increasing the overall robustness of KUE. An extensive experimental study shows that KUE is capable of outperforming state-of-the-art ensembles on standard and imbalanced drifting data streams while having a low computational complexity. Moreover, we analyze the use of Kappa vs accuracy to drive the criterion to select and update the classifiers, the contribution of the abstaining mechanism, the contribution of the diversification of classifiers, and the contribution of the hybrid architecture to update the classifiers in an online manner.

Evolving Rule-Based Classifiers with Genetic Programming on GPUs for Drifting Data Streams

Pattern Recognition

A. Cano and B. Krawczyk

2019-03-13

Designing efficient algorithms for mining massive high-speed data streams has become one of the contemporary challenges for the machine learning community. Such models must display highest possible accuracy and ability to swiftly adapt to any kind of changes, while at the same time being characterized by low time and memory complexities. However, little attention has been paid to designing learning systems that will allow us to gain a better understanding of incoming data. There are few proposals on how to design interpretable classifiers for drifting data streams, yet most of them are characterized by a significant trade-off between accuracy and interpretability. In this paper, we show that it is possible to have all of these desirable properties in one model. We introduce ERulesD2S: evolving rule-based classifier for drifting data Streams. By using grammar-guided genetic programming, we are able to obtain accurate sets of rules per class that are able to adapt to changes in the stream without a need for an explicit drift detector. Additionally, we augment our learning model with new proposals for rule propagation and data stream sampling, in order to maintain a balance between learning and forgetting of concepts. To improve efficiency of mining massive and non-stationary data, we implement ERulesD2S parallelized on GPUs. A thorough experimental study on 30 datasets proves that ERulesD2S is able to efficiently adapt to any type of concept drift and outperform state-of-the-art rule-based classifiers, while using small number of rules. At the same time ERulesD2S is highly competitive to other single and ensemble learners in terms of accuracy and computational complexity, while offering fully interpretable classification rules. Additionally, we show that ERulesD2S can scale-up efficiently to high-dimensional data streams, while offering very fast update and classification times. Finally, we present the learning capabilities of ERulesD2S for sparsely labeled data streams.

Show All +

Alberto Cano, Ph.D.

Links

Social

Biography

Areas of Expertise

Accomplishments

Top 2% of most cited researchers in AI field by Stanford University ranking

Amazon Machine Learning Award

Education

University of Granada, Spain

University of Cordoba, Spain

University of Granada, Spain

University of Cordoba, Spain

University of Cordoba, Spain

Research Grants

MRI: Track 1 Acquisition of NVIDIA DGX H100 GPU system for research and education at VCU

SentimentVoice: Integrating emotion AI and VR in Performing Arts

HPRC research computing clusters

Multi-Objective Optimization of Inlet Nozzle Design using Artificial Intelligence for Single Tank Thermal Energy Storage

Courses

CMSC 508 - Databases

CMSC 603 - High Performance Distributed Systems

Selected Articles

A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework

ROSE: Robust Online Self-Adjusting Ensemble for Continual Learning on Imbalanced Drifting Data Streams

Kappa Updated Ensemble for Drifting Data Stream Mining

Multi-label Punitive kNN with Self-Adjusting Memory for Drifting Data Streams

Evolving Rule-Based Classifiers with Genetic Programming on GPUs for Drifting Data Streams

Interpretable Multi-view Early Warning System adapted to Underrepresented Student Populations

A survey on graphic processing unit computing for large-scale data mining

Distributed Nearest Neighbor Classification for Large-Scale Multi-label Data on Spark

MIRSVM: Multi-Instance Support Vector Machine with Bag Representatives